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We  describe  the  design  and  early  functionality  of  the 
Active  Harmony  global  resource  management  system. 
Harmony  is  an  infrastructure  designed  to  efficiently 
execute  parallel  applications  in  large-scale ,  dynamic 
environments. 

Harmony  differs  from  other  projects  with  similar  goals 
in  that  the  system  automatically  adapts  ongoing  com¬ 
putations  to  changing  conditions  through  online  reco  n- 
figuration.  This  reconfiguration  can  consist  of  system- 
directed  migration  of  work  at  several  different  levels,  or 
automatic  application  adaptation  through  the  use  of 
tuning  options  exported  by  Harmony-aware  applica¬ 
tions. 

We  describe  early  experience  with  work  migration  at  the 
level  of  procedures,  processes  and  lightweight  threads. 

1.  Introduction 

Meta-computing,  the  simultaneous  and  coordinated  use 
of  semi-autonomous  computing  resources  in  physically 
separate  locations,  is  increasingly  being  used  to  solve 
large-scale  scientific  problems.  By  using  a  collection  of 
specialized  computational  and  data  resources  located  at 
different  facilities  around  the  world,  work  can  be  done 
more  efficiently  than  if  only  local  resources  were  used. 
However,  the  infrastructure  to  support  this  type  of 
global-scale  computation  is  not  yet  available. 

Both  meta-computer  environments  and  the  applica¬ 
tions  that  run  on  them  can  be  characterized  by  distrib  u- 
tion,  heterogeneity,  and  changing  resource  requirements 
and  capacities.  These  attributes  make  static  approaches 
to  resource  allocation  unsuitable.  Systems  need  to  dy¬ 
namically  adapt  to  changing  resource  capacities  and 
application  requirements  in  order  to  achieve  high  per¬ 
formance  in  such  environments. 

We  are  designing  and  building  Active  Harmony,  a 
software  architecture  that  manages  distributed  execution 
of  computational  objects  in  such  environments.  Our 
primary  focus  is  on  the  following  three  areas: 

•  Support  for  dynamic  execution  environments: 
Dynamic  adaptation  to  network  and  resource  ca¬ 
pacities,  both  when  computational  objects  are  cre¬ 
ated,  and  when  application  requirements  or  resource 


capacities  change.  Active  Harmony  attempts  to 
maximize  data  affinity  (moving  computation  near 
its  data)  and  load  balancing  through  intelligent  r  e- 
source  allocation  and  object  migration.  Harmony 
supports  an  extensible  metric  interface  that  permits 
the  sharing  of  resource  capacity  and  utilization  i  n- 
formation  among  components  in  a  distributed  sys¬ 
tem. 

•  Application  adaptation:  A  measurement  and  feed¬ 
back  system  that  adapts  computational  objects  and 
their  execution  environment  to  improve  the  overall 
performance  of  the  distributed  system  via  runtime 
adaptation  of  algorithms,  data  distribution,  and  load 
balancing.  Active  Harmony  exports  a  detailed  met¬ 
ric  interface  to  applications,  allowing  them  to  access 
processor,  network,  and  operating  system  parame¬ 
ters.  Applications  export  tuning  options  to  the  sys¬ 
tem,  which  can  then  automatically  optimize  re¬ 
source  allocation.  Measurement  and  tuning  there¬ 
fore  become  first  class  objects  in  the  programming 
model.  Programmers  can  write  applications  that  in¬ 
clude  ways  to  adapt  computation  to  observed  per¬ 
formance  and  changing  conditions. 

•  Shared-data  interfaces:  Active  Harmony  supports 
single-system  semantics  among  computational  ob¬ 
jects  regardless  of  location,  including  consistent 
shared  data  segments.  Shared  data  segments  allow 
both  peer-to-peer  and  client-server  computations  to 
exploit  the  simplified  programming  model  and  fine¬ 
grained  sharing  permitted  by  shared-memory  envi¬ 
ronments.  Innovations  include  support  for  hetero¬ 
geneity  of  both  data  and  program  code,  and  support 
for  the  dynamic  execution  environment.  Harmony 
shared  data  interface  also  includes  methods  of 
measuring  the  data  affinity  between  arbitrary  ob¬ 
jects. 

The  Active  Harmony  system  is  targeted  at  long- 

lived  and  persistent  applications.  Examples  of  long-lived 

applications  include  scientific  code  and  data  mining 

applications.  Persistent  applications  include  file  servers, 
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information  servers,  and  database  management  systems. 
We  target  long-lived  applications  because  they  persist 
long  enough  for  the  global  environment  to  change,  and 
hence  have  higher  potential  for  improvement. 

Our  emphasis  on  long-lived  applications  allows  us 
to  focus  on  relatively  expensive  operations  such  as  object 
migration.  The  cost  of  these  operations  can  be  amortized 
across  the  life  of  the  object.  Additionally,  in  order  to 
implement  a  feed-back  loop  that  does  not  oscillate  out- 
of-control,  changes  must  be  implemented  gradually  in 
order  to  allow  the  effects  of  one  change  to  settle  before 
other  changes  are  made. 

Inherent  to  our  assumption  that  we  can  adapt  long- 
lived  parallel  applications  is  the  belief  that  the  past  per¬ 
formance  of  an  application  is  a  good  predictor  of  future 
behavior.  We  feel  this  is  true  because  many  programs 
tend  to  be  cyclic  or  iterative  in  nature.  As  a  result,  if  we 
can  measure  a  program  for  one  cycle,  we  can  then  adapt 
it  for  the  next  cycle  and  hope  to  improve  its  perform¬ 
ance.  Examples  of  applications  that  have  this  behavior 
include  most  scientific  codes  and  periodic  requests  to 
information  and  database  severs. 

Resource  management  in  meta-computing  envi¬ 
ronments  is  a  complex  task.  This  paper  focuses  on  the 
problem  of  developing  metrics  to  measure  changing 
computational  needs  and  the  techniques  to  react  to  them. 
Section  2  describes  the  overall  architecture  of  the  Har¬ 
mony  System.  Section  3  describes  Harmony use  of 
LBF,  a  performance  metric  designed  to  predict  the  im¬ 
pact  of  migrating  both  process-level  and  procedure-level 
computation  tasks.  Section  4  describes  Harmony fc  ap¬ 
proach  to  communication  minimization  through  thread 
migration.  Finally,  Section  5  discusses  related  work  and 
Section  6  concludes. 

2.  Scheduling  and  adaptation 

Traditional  approaches  to  resource  management,  such  as 
space  sharing  and  gang  scheduling,  rely  on  the  system 
being  static  and  under  the  full  control  of  a  centralized 
scheduler.  They  also  rely  on  all  system  jobs  being  ho¬ 
mogenous,  at  least  to  the  level  of  the  types  of  informa¬ 
tion  that  can  be  obtained  from  the  jobs.  Harmonyfc  ta  r- 
get  environment  is  dynamic,  heterogeneous,  and  can 
contain  sub-systems  that  are  not  fully  under  Harmony 
control.  Furthermore,  we  intend  to  exploit  application- 
specific  information  to  improve  resource  utilization. 

For  example,  the  underlying  system  can  obtain 
sharing  information  about  threads  in  distributed  shared 
memory  (DSM)  applications.  Another  application  might 
explicitly  export  tuning  options  to  the  system.  These 
options  might  allow  the  system  to  adapt  the  application 
to  available  memory  and  CPU  resources.  Such  disparate 
types  of  information  present  inherently  different  chal¬ 
lenges  and  opportunities  to  schedulers,  and  could  proba¬ 


bly  be  best  handled  in  a  decentralized  fashion.  However, 
global  schedulers  provide  a  single  point  where  broad 
policy  decisions  can  be  made 

2.1  Adaptation  policies 

The  primary  motivation  of  Harmony application  and 
environment  characterizations  is  their  use  in  improving 
resource  utilization  and  throughput.  Traditional  ap¬ 
proaches  to  scheduling,  such  as  space  sharing  and  gang¬ 
scheduling,  rely  on  the  system  being  static  and  under  the 
full  control  of  the  scheduler.  Our  target  environment  is 
neither.  Among  the  issues  that  we  address  are  the  fol¬ 
lowing: 

1)  centralized  versus  decentralized  control  -  Cen¬ 
tralized  control  usually  implies  better  decision¬ 
making  because  of  more  complete  information. 
However,  up-to-date  information  is  expensive  to 
collect  in  large  or  dynamic  systems.  Since  our  target 
environments  are  both,  we  expect  the  ability  to 
profitably  use  slightly  stale  system  information  to  be 
important. 

2)  data-shipping  versus  function-shipping  -  At  any 

point  where  computational  objects  interact  with  ei¬ 
ther  remote  data  sources  or  objects,  the  system  can 
potentially  improve  performance  by  migrating  the 
objects  closer  together.  The  systems  must  determine 
when  this  is  appropriate,  and  in  what  fashion  any 
migration  would  be  accomplished. 

3)  throughput  versus  response  time  -  While  through¬ 
put  is  likely  to  be  the  most  important  measure  of 
system  performance,  the  application  mix  will  in¬ 
clude  applications  for  which  response  time  is  im¬ 
portant,  and  the  system  might  also  be  forced  to  co¬ 
exist  with  applications  that  are  both  outside  the 
control  of  the  system,  and  whose  performance  must 
not  be  degraded. 

Harmony  supports  both  domain-specific  decision  proc¬ 
esses  and  global  policies  through  hierarchical  resource 
management.  The  adaptation  controller  exports  broad 
policy  choices  through  a  domain-independent  interface 
to  domain  schedulers.  Similarly,  domain-independent 
resource  and  capacity  information  is  passed  from  the 
domain  schedulers  up  to  the  adaptation  controller.  Each 
domain  scheduler  translates  global  policies  into  policies 
specific  to  the  type  of  application  it  controls. 

2.2  Harmony^  structure 

Harmony structure  is  shown  in  Figure  1.  The  major 
components  are  the  following: 

Adaptation  controller 

The  adaptation  controller  is  the  heart  of  the  system.  The 
controller  must  gather  relevant  information  to  be  used  as 


input,  project  the  effects  of  proposed  changes  (such  as 
migrating  an  object)  on  the  system,  and  weigh  compet¬ 
ing  costs  and  expected  benefits  of  making  various 
changes. 

The  design  of  the  adaptation  controller  has  yet  to  be 
finalized.  However,  the  system  will  probably  use  a  pat¬ 
tern-matching  approach.  Patterns  will  be  used  categorize 
potential  problems  into  specific  problem  domains ,  and  to 
apply  sophisticated  domain-specific  optimizations.  For 
example,  if  the  adaptation  controller  contained  a  predi¬ 
cate  that  indicates  the  system  is  suffering  from  poor  load 
balance,  the  pattern  that  recognizes  the  situation  might 
trigger  one  or  more  load  balancing  algorithms.  To¬ 
gether,  the  problem-recognition  patterns  and  the  do¬ 
main-specific  techniques  will  form  a  complete  resource 
management  system. 

Active  Harmony  provides  mechanisms  for  applica¬ 
tions  to  export  tuning  options,  together  with  information 
about  the  resource  requirements  of  each  option,  to  the 
adaptation  controller.  The  adaptation  controller  then 
chooses  among  the  exported  options  based  on  more 
complete  information  than  is  available  to  individual 
objects.  A  key  advantage  of  this  technique  is  that  the 
system  can  tune  not  just  individual  objects,  but  also  en¬ 
tire  collections  of  objects.  Possible  tuning  criteria  in¬ 
clude  network  latency  and  bandwidth,  memory  utiliza¬ 
tion,  and  processor  time.  Since  changing  implementa¬ 
tions  or  data  layout  could  require  significant  time,  we 
propose  to  include  a  cost  function  that  can  be  used  by 
the  tuning  system  to  evaluate  if  a  tuning  option  is  worth 
the  effort  required. 

Metric  interface 

The  metric  interface  provides  a  unified  way  to  gather 


data  about  the  performance  of  applications  and  their 
execution  environment.  Data  about  system  conditions 
and  application  resource  requirements  flow  into  the  met¬ 
ric  interface,  and  on  to  both  the  adaptation  controller 
and  individual  applications. 

Tuning  interface 

The  tuning  interface  provides  a  method  for  applications 
to  export  tuning  options  to  the  system.  Each  tuning  op¬ 
tion  defines  the  expected  consumption  of  one  or  more 
system  resources.  The  options  are  intended  to  be 
‘knobs”  that  the  system  can  use  to  adjust  applications  to 
changes  in  the  environment.  The  main  concern  in  d  e- 
signing  the  tuning  interface  is  to  ensure  that  it  is  expre  s- 
sive  enough  to  describe  the  effects  of  all  application 
tuning  options. 

3.  Adaptation  metrics 

To  make  informed  choices  about  adapting  an  applica¬ 
tion,  Harmony  needs  metrics  to  predict  the  performance 
implications  of  any  changes.  To  meet  this  need,  we  have 
developed  a  metric  called  Load  Balancing  Factor  (LBF) 
to  predict  the  impact  of  changing  where  computation  is 
performed.  This  metric  can  be  used  by  the  system  to 
evaluate  potential  application  reconfigurations  before 
committing  to  potentially  poor  choices. 

We  have  developed  two  variants  of  LBF,  one  for 
process  level  migration,  and  one  for  fine-grained  proce¬ 
dure  level  migration.  Process  Load  Balancing  Factor 
(LBF)  predicts  the  impact  of  changing  the  assignment  of 
processes  to  processors  in  a  distributed  execution  envi¬ 
ronment.  Our  goal  is  to  compute  the  potential  improve¬ 
ment  in  execution  time  if  we  change  the  placement.  Our 
technique  can  also  be  used  to  predict  the  performance 


Figure  1:  Major  Components  of  Active  Harmony 


gain  possible  if  new  nodes  are  added.  Also,  we  are  able 
to  predict  how  the  application  would  behave  if  the  per¬ 
formance  characteristics  of  the  communication  system 
were  to  change. 

To  assess  the  potential  improvement,  we  predict  the 
execution  time  of  a  program  with  a  virtual  placement, 
during  an  execution  on  a  different  one.  Our  approach  is 
to  instrument  application  processes  to  forward  data 
about  inter-process  events  to  a  central  monitoring  sta¬ 
tion  that  simulates  the  execution  of  these  events  under 
the  target  configuration. 

The  details  of  the  algorithm  for  process  level-LBF 
are  described  in  [1],  Early  experience  with  process-LBF 
is  encouraging.  Figure  2  shows  a  summary  of  the  meas¬ 
ured  and  predicted  performance  for  a  TSP  application, 
and  four  of  the  NAS  benchmark  programs  [2].  For  each 
application,  we  show  the  measured  running  time  for  one 
or  two  configurations  and  the  predicted  running  time 
when  the  number  of  nodes  is  used.  For  all  cases,  we  are 
able  to  predict  the  running  time  to  within  6%  of  the 
measured  time. 

While  process  LBF  is  designed  for  course-grained 
migration,  procedure-level  LBF  is  designed  to  measure 
the  impact  of  fine-grained  moved  of  work.  The  goal  of 
this  metric  is  to  compute  the  potential  improvement  in 
execution  time  if  we  move  a  selected  procedure,  F,  from 
the  client  to  the  server  or  visa- versa. 


Application 

Target 

Meas. 

Time 

Pred.  Error 

Pred.  Error 

TSP 

4/1 

4/1 

4/4 

85.6 

85.5  0.1  (0.1%) 

85.9  -0.3  (-0.4%) 

4/1 

199.2 

197.1  2.1  (1.1%) 

198.9  0.3  (0.2%) 

EP  -  class  A 

16/16 

16/8 

16/16 

258.2 

255.6  2.6  (1.0%) 

260.7  -2.5  (-1.0%) 

FT-  class  A 

16/16 

16/8 

16/16 

140.9 

139.2  1.7  (1.2%) 

140.0  0.9  (0.6%) 

IS-  class  A 

16/16 

16/8 

16/16 

271.2 

253.3  17.9  (6.6%) 

254.7  16.5  (6.0%) 

MG-  class  A 

16/16 

16/8 

16/16 

172.8 

166.0  6.8  (4.0%) 

168.5  4.3  (2.5%) 

Figure  2:  Measured  and  predicted  time  for  LBF. 

For  each  application,  we  show  1-2  target  configura¬ 
tions.  The  second  column  shows  the  measured  time 
running  on  this  target  configuration.  The  rest  of  the 
table  shows  the  execution  times  predicted  by  LBF 
when  run  under  two  different  actual  configurations. 

The  algorithm  used  to  compute  procedure  is  based  on 
the  Critical  Path  (CP)  of  a  parallel  computation  (The 
longest  process  time  weighted  path  through  the  graph 
formed  by  the  inter-process  communication  in  the  pro¬ 
gram).  The  idea  of  procedure  LBF  is  to  compute  the  new 


CP  of  the  program  if  the  selected  procedure  was  moved 
from  one  process  to  another2. 

In  each  process,  we  keep  track  of  the  original  CP 
and  the  new  CP  due  to  moving  the  selected  procedure. 
We  compute  procedure  LBF  at  each  message  exchange. 
At  a  send  event,  we  subtract  the  accumulated  time  of  the 
selected  procedure  from  the  CP  of  the  sending  process, 
and  send  the  accumulated  procedure  time  along  with  the 
application  message.  At  a  receive  event,  we  add  the 
passed  procedure  time  to  the  CP  value  of  the  receiving 
process  before  the  receive  event.  The  value  of  the  pro¬ 
cedure  LBF  metric  is  the  total  effective  CP  value  at  the 
end  of  the  program fc  execution.  Procedure  LBF  only 
approximates  the  execution  time  with  migration  since 
we  ignore  many  subtle  issues  such  as  global  data  refer¬ 
ences  by  the  “moved”  procedure.  Figure  3  shows  the 
computation  of  procedure  LBF  for  a  single  message 
send.  Our  intent  with  this  metric  is  to  supply  initial 
feedback  to  the  programmer  about  the  potential  of  a 
tuning  alternative.  A  more  refined  prediction  that  inco  r- 
porates  shared  data  analysis  could  be  run  after  our  met¬ 
ric  but  before  proceeding  to  a  full  implementation. 
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Before  "moving”  F  After  "moving"  F 

Figure  3:  Computing  procedure  LBF  -  The  PAG 
before  and  after  moving  the  procedure  F.  The  time 
for  the  procedure  F  is  moved  from  the  sending  proc¬ 
ess  (which  is  on  the  application’s  critical  path)  to 
the  receiving  one  (which  is  not). 

We  created  a  Synthetic  Parallel  Application  (SPA)  that 
demonstrates  a  workload  where  a  single  server  becomes 
the  bottleneck  responding  to  requests  from  three  clients. 
In  the  server,  two  classes  of  requests  are  processed:  ser- 
vBusyl  and  servBusy2.  ServBusyl  is  the  service  re¬ 
quested  by  the  first  client  and  servBusyl  is  the  service 
requested  by  the  other  two  clients. 


2  Our  metric  does  not  evaluate  how  to  move  the  procedure. 
However,  this  movement  is  possible  if  the  application  uses 
Harmony's  shared  data  programming  model. 


The  results  of  computing  procedure  LBF  for  the 
synthetic  parallel  application  are  shown  in  Figure  4.  To 
vallate  these  results,  we  created  two  modified  versions 
of  the  synthetic  parallel  application  (one  with  each  of 
servBusyl  and  servBusy2  moved  from  the  server  the 
clients)  and  measured  the  resulting  execution  time3.  The 
results  of  the  modified  programs  are  shown  in  the  third 
column  of  Figure  4.  In  both  cases,  the  error  is  small  in¬ 
dicating  that  our  metric  has  provided  good  guidance  to 
the  application  programmer. 


Procedure 

Proce¬ 
dure  LBF 

Measured 

Time 

Differ¬ 

ence 

ServBusyl 

25.3 

•  25.4 

0.1 

(0.4%) 

ServBusy2 

23.0 

23.1 

0.1 

(0.6%) 

Figure  4:  Procedure  LBF  accuracy. 

For  comparison  to  an  alternative  tuning  option,  we 
also  show  the  value  for  the  Critical  Path  Zeroing  metric 
[3].  CP  Zeroing  is  a  metric  that  predicts  the  improve¬ 
ment  possible  due  to  optimally  tuning  the  selected  pro¬ 
cedure  (i.e.,  reducing  its  execution  time  to  zero)  by 
computing  the  length  of  the  critical  path  resulting  from 
setting  the  time  of  the  selected  procedure  to  zero.  We 
compare  LBF  with  Critical  Path  Zeroing  because  it  is 
natural  to  consider  improving  the  performance  of  a  pro¬ 
cedure  itself  as  well  as  changing  its  execution  place 
(processor)  as  tuning  strategies. 

The  length  of  the  new  CP  due  to  the  movement  of 
servBusyl  is  25.4  and  the  length  due  to  servBusy2  is 
16.1  while  the  length  of  the  original  CP  is  30.7.  With 
the  Critical  Path  Zeroing  metric,  we  achieve  almost  the 
same  benefit  as  tuning  the  procedure  ServBusyl  by  sim¬ 
ply  moving  it  from  the  server  to  the  client.  Likewise,  we 
achieve  over  one-half  the  benefit  of  tuning  the  Ser- 
vBusy2  procedure  by  moving  it  to  the  client  side. 


Procedure 

LBF 

Improve¬ 

ment 

CP 

Zeroing 

Improve¬ 

ment 

ServBusyl 

ServBusy2 

25.3 

23.1 

17.8% 

25.1% 

25.4 

16.1 

17.4% 

47.5% 

Figure  5:  Procedure  LBF  vs.  CP  Zeroing. 

4.  Adaptation  via  thread  migration 

LBF  allows  Harmony  to  evaluate  the  computational  ef¬ 
fects  of  moving  procedures  and  processes  among  distinct 

3  Since  Harmony's  shared  data  programming  model  is  not  yet 
fully  implemented,  we  made  these  changes  by  hand. 


nodes.  This  section  discusses  Harmony^  ability  to 
gather  and  use  somewhat  analogous  information  from 
shared-memory  applications. 

Harmony  provides  a  shared  memory  abstraction  to 
parallel  applications  running  on  networks  of  worksta¬ 
tions.  Systems  with  such  support  are  commonly  termed 
distributed  shared  memory  (DSM)  systems.  DSM  appli¬ 
cations  are  multi-threaded,  and  assumed  to  have  many 
more  threads  than  the  number  of  nodes  used  by  any  one 
application.  Overall  performance  depends  on  parallel¬ 
ism,  load  balance,  latency  tolerance,  and  communication 
minimization.  In  this  paper,  we  focus  on  communication 
minimization. 

Communication  results  primarily  from  data  sharing 
between  threads.  Hence,  co-locating  communicating 
threads  on  the  same  nodes  can  reduce  communication. 
Harmony  obtains  sharing  information  through  an  active 
correlation-tracking  mechanism  [4].  Previous  systems 
obtained  page-level  access  information  by  tracking  ex¬ 
isting  page  faults.  Page  faults  occur  when  local  threads 
access  invalid  shared  pages.  Invalid  pages  are  re¬ 
validated  by  fetching  the  latest  version  of  the  shared 
page  from  the  last  node  that  modified  it.  The  underlying 
system  keeps  track  of  the  thread  that  caused  each  page 
fault,  slowly  building  up  a  pattern  of  the  pages  accessed 
by  each  thread.  The  correlation  between  a  pair  of 
threads  is  the  number  of  shared  pages  that  they  access  in 
common. 

The  problem  is  that  there  are  usually  multiple 
threads  running  on  each  machine,  and  these  threads 
share  state.  Once  the  first  thread  on  a  node  re- validates  a 
given  page,  all  other  local  threads  can  access  the  page 
without  invoking  the  DSM  system. 

Hence,  the  system  only  gains  partial  information 
about  the  sharing  behavior  of  local  threads.  Any  migra¬ 
tion  decisions  are  therefore  made  with  only  partial  in¬ 
formation,  often  leading  to  bad  long-term  choices.  Bad 
choices  are  discovered  only  after  the  threads  have  been 
migrated  to  other  processors.  Once  a  thread  migrates  off 
of  a  local  host,  the  interactions  between  that  thread  and 
those  left  behind  become  visible  in  the  form  of  network 
page  faults  (unless  masked  by  the  actions  of  other 
threads  on  the  new  node).  These  faults  can  be  used  to 
identify  threads  that  should  then  be  moved  back  to  their 
original  position,  resulting  in  ping-ponging  of  threads 
across  the  system. 

Active  correlation-tracking  avoids  these  problems 
through  a  one-time  correlation-tracking  phase.  Briefly, 
the  algorithm  is  as  follows: 

1 )  All  pages  are  marked  invalid. 

2)  At  each  access  fault  caused  by  the  above  step  (a 
correlation  fault),  the  page  is  noted  and  the  pagefc 
protection  is  returned  to  its  original  state. 
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Figure  6:  Passive  information-gathering 

3)  At  the  next  barrier,  untouched  pages  are  returned 
to  their  previous  protection  level. 

After  the  tracking  phase  has  ended,  each  processor  has  a 
complete  record  of  all  pages  accessed  by  local  threads. 

We  measure  the  performance  impact  of  correlation¬ 
tracking  on  several  applications  from  several  standard 
DSM  applications4.  The  tracking  phased  primary  ove  r- 
head  results  from  the  correlation  faults.  However,  be¬ 
cause  the  faults  are  incurred  in  parallel  across  all  nodes 
in  the  system,  they  cause  an  overhead  of  less  than  20%. 
Furthermore,  this  overhead  only  occurs  during  the 
tracking  phase.  Since  the  tracking  phase  usually  only 
occurs  once  at  the  beginning  of  an  application^  exec  u- 
tion,  this  cost  has  a  negligible  effect  on  overall  perform¬ 
ance. 

Note  that  active  correlation-tracking  gives  Harmony 
complete  sharing  information  without  network  commu¬ 
nication.  Hence,  a  new  configuration  can  be  imple¬ 
mented  in  only  a  single  round  of  thread  migrations.  By 
contrast,  Figure  6  shows  that  the  passive  approach  re¬ 
quires  an  average  of  more  than  six  rounds  of  mass 
thread  migrations  before  the  amount  of  information  sta¬ 
bilizes.  Each  point  shows  the  percentage  of  total  infor¬ 
mation  learned  by  the  passive  approach  at  that  round. 
Even  at  the  end  of  the  migrations,  the  passive  tracking 
only  comes  close  to  obtaining  complete  information  for 
sor,  by  far  the  least  complex  of  our  applications. 

Information  obtained  from  the  active  correlation¬ 
tracking  mechanism  is  used  to  create  correlation  maps , 
which  summarize  sharing  information  among  all 
threads  in  the  system.  Figure  7  shows  a  16-thread  cor¬ 
relation  map  of  a  3-D  FFT  with  64  by  64  by  16  data 
points.  The  map  shows  a  well-defined  structure  in  which 
all  of  the  high  peaks  are  concentrated  along  the  diag  o- 
nal.  Extension  along  the  z  axis  represents  the  correlation 
(in  terms  of  the  number  of  pages  shared  between  threads 


4  Please  see  [5,  6]  for  details  of  our  test  applications. 


Figure  7:  16-thread  FFT 


x  and  y)  between  the  two  threads  identified  by  the  x  and 
y  coordinates.  The  majority  of  FFTfe  data  sharing  occurs 
inside  four-thread  groupings.  The  lowest  peak  represents 
sharing  among  threads  1-4,  the  next  peak  represents 
sharing  between  among  nodes  5-8,  etc.  This  map  im¬ 
plies  that  a  mapping  of  four  threads  to  each  of  four 
processors  would  avoid  network  communication  for  the 
majority  of  the  sharing  between  threads.  However,  an 
eight  by  two  mapping  would  exclude  the  majority  of  the 
peaks,  implying  that  it  would  cause  much  more  commu¬ 
nication  mapping.  What  is  not  clear  from  the  map  is  to 
what  extent  this  communication  advantage  would 
translate  into  a  performance  advantage. 

Given  complete  information  on  the  sharing  between 
threads,  Harmony  attempts  to  reduce  communication  by 
co-locating  thread  pairs  that  share  the  most  data. 

This  problem  is  NP-complete,  so  we  evaluated  sev¬ 
eral  heuristics  for  mapping  specific  threads  to  nodes. 
Briefly,  we  looked  at  both  leader-based  and  leaderless 
variants  of  AscEdge,  DesEdge,  and  DesNode  [4].  As- 
cEdge  treats  threads  and  their  communication  (or  cor¬ 
relation)  as  nodes  and  edges  of  a  weighted  graph,  re¬ 
spectively.  We  map  threads  to  nodes  by  sorting  edges  by 
weight  (communication  cost)  in  ascending  order.  The 
threads  representing  the  endpoints  of  each  edge  are  put 
onto  distinct  nodes,  if  possible.  Each  thread  is  mapped 
to  the  node  with  which  the  thread  has  the  highest  aggr  e- 
gate  correlation.  DesEdge  differs  in  that  the  edges  are 
sorted  in  descending  order,  and  threads  are  placed  on 
the  same  nodes,  if  possible.  DesNode  sorts  threads  by 
aggregate  correlation,  and  maps  each  thread  to  the  node 
with  which  it  has  the  highest  aggregate  correl  ation. 

Figure  8  shows  the  communication  costs  that  result 
from  running  each  of  the  heuristics  on  information 
gathered  through  active  correlation-tracking.  The  first 
five  columns  give  communication  costs  in  the  number  of 
pages  shared  by  pairs  of  threads.  The  second  set  of  five 


columns  gives  the  number  of  kilobytes  communicated 
per  iteration,  and  the  last  five  columns  give  the  number 
of  messages  sent  per  iteration.  The  heuristics  AscEdge, 
DesEdge,  and  DesNode  are  abbreviated  ‘ae’  tie’  and 
‘dn’,  respectively.  Leader-based  variants  are  identified 
by  -1*  suffixes.  Additionally,  we  also  show  the  comm  u- 
nication  costs  of  the  optimal  configuration  (bpt),  and 
of  a  random  configuration  (t*). 

There  is  a  large  amount  of  variation  across  the  di  f- 
ferent  applications,  but  de-1  generally  performs  the  best, 
and  random  the  worst.  On  average,  de-1  obtains  a  solu¬ 
tion  within  0.3%  of  optimal.  This  performance  appears 
to  rest  on  two  advantages.  First,  de-1  is  usually  able  to 
group  threads  connected  by  high-cost  nodes  together  on 
the  same  nodes.  Second,  the  leader-based  approaches 
appear  to  help  ensure  that  the  nodes  are  filled  at  the 
same  pace,  rather  than  having  the  first  node  fill  com¬ 
pletely,  then  the  second,  etc.  The  problem  with  filling 
nodes  unevenly  is  that  it  increases  the  chance  that  a 
high-cost  edge  has  to  be  split  across  nodes  because  of 
one  node  being  filled. 

The  relative  communication  cost  magnitudes  match 
up  quite  well  with  the  number  of  bytes  communicated 
and  messages  sent.  However,  the  differences  in  commu¬ 
nication  cost  are  exaggerated  in  the  byte  and  message 
totals.  This  implies  that  the  page  sharing  addressed  well 
by  all  of  the  heuristics  causes  relatively  less  communi¬ 
cation  than  the  pages  that  are  handled  better  by  some 
heuristics  than  others. 

5.  Related  work 

Process  migration  techniques  have  been  investigated  in 
depth  [7-9].  Recent  work  has  investigated  using  hetero¬ 
geneous  migration  in  conjunction  with  typesafe  lan¬ 
guages,  and  in  situations  where  the  type-safety  of  appli¬ 
cations  written  in  non-typesafe  languages  can  be  verified 
[10].  Harmony  is  different  in  that  we  focus  on  the  policy 
issues  of  when  to  migrate  large,  long-running  distributed 
applications;  and  whether  to  migrate  the  process  or  the 
data. 

Farming  computational  objects  to  nodes  of  a  dis¬ 
tributed  system  has  been  exploited  by  many  projects  [11- 


19],  Several  projects  also  allow  process  migration  across 
homogeneous  processor  borders  for  purposes  of  load 
balancing.  Dome  [18]  also  allows  migration  across  het¬ 
erogeneous  boundaries,  but  uses  high-level  checkpoints 
consisting  of  user-written  routines  that  marshal  and  un¬ 
marshal  significant  data  structure  manually.  Additio  n- 
ally,  Dome  only  supports  applications  based  on  a  library 
of  parallel  data  structures;  arbitrary  parallel  applications 
are  not  supported.  The  Harmony  project  is  distinguished 
from  these  projects  by  the  following  factors:  emphasis 
on  dynamic  environments,  integration  of  application 
tuning  options  and  semantic  information  into  resource 
management  algorithms,  and  inclusion  of  such  into  r  e- 
source  management  algorithms. 

Several  studies  [7]  claim  that  load-balancing  via 
process  migration  is  not  a  viable  strategy  for  improving 
performance,  primarily  because  of  large  migration  over¬ 
head.  We  believe  that  two  changes  in  current  systems 
make  these  results  less  relevant  today  than  with  systems 
of  a  decade  ago.  First,  computationally  intensive  appli¬ 
cations  are  now  a  mainstay  on  the  type  of  workstations 
we  are  considering  in  our  work.  These  applications  have 
large  resource  demands,  and  sufficient  execution  time  to 
amortize  the  cost  of  process  migration.  Second,  the 
prevalence  of  parallel  and  client-server  applications 
means  that  data  affinity  becomes  at  least  as  important  as 
processor  cycles  for  these  applications.  Hence,  there  is 
more  potential  for  improved  performance  with  current 
applications  than  with  typical  applications  of  a  decade 
ago. 

Globus  [20]  and  Legion  [21]  also  provide  infra¬ 
structures  that  support  program  execution  in  a  meta¬ 
computing  environment.  Harmony  differs  from  these 
systems  by  focusing  on  developing  metrics  and  algo¬ 
rithms  for  program  adaptation,  and  by  our  use  of  shared- 
data  interfaces  as  part  of  the  programming  model.  G1  o- 
bus  and  Legion  provide  other  essential  services  for 
meta-computing  including  naming,  security,  and  com¬ 
munication.  We  are  investigating  using  one  or  both  of 
these  systems  as  a  test-bed  for  the  resource  management 
policies  we  are  developing. 

AppLeS  [22]  provides  programmers  with  applica- 
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tion  level  scheduling.  Harmony  differs  in  that  our  focus 
is  on  providing  middle-ware  that  provides  metrics  and 
mechanisms  to  alter  the  behavior  of  a  program  in  exe¬ 
cution.  In  AppLeS,  the  application  programmer  is  sup¬ 
plied  information  about  the  computing  environment  [23] 
and  given  a  library  to  let  them  react  to  the  changes  in 
available  resources.  In  Harmony,  we  are  trying  to  have 
the  application  supply  alternative  ways  of  executing  the 
program  and  then  let  the  runtime  software  select  among 
these  options  based  on  observations  of  the  environment. 

6.  Summary  and  discussion 

We  have  described  Active  Harmony,  a  new  infrastruc¬ 
ture  for  managing  resources  in  large,  dynamic  environ¬ 
ments.  One  of  Harmony  fc  major  strengths  is  its  ability  to 
gather  and  exploit  many  types  of  application-specific 
information.  This  information  can  then  be  used  to  aut  o- 
matically  reconfigure  running  applications  at  several 
levels.  This  paper  has  concentrated  on  two  such  mecha¬ 
nisms.  First,  the  LBF  metric  predicts  the  performance 
impact  of  moving  procedures  and  processes.  Second, 
active  correlation-  tracking  is  used  to  gather  data  shar¬ 
ing  information  and  drive  thread  migration  decision 
heuristics. 

However,  Harmonyfc  strength  is  not  in  the  indivi  d- 
ual  mechanisms  and  metrics  used,  but  in  the  interfaces 
that  allow  global  policy  decisions  and  many  types  of 
application-specific  information  to  be  tied  together. 
Harmony  is  a  work  in  progress,  and  we  are  continuing 
to  evaluate  new  sources  of  information  and  new  mecha¬ 
nisms  for  possible  inclusion  in  the  system. 
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